Solvent Coefficients For Alternative 'Safe' Solvents
Researchers: Jean-Claude Bradley and Andrew SID Lang
All content, models and data are released as CC0 - the default license for all our ONS work.
This page is a duplicate (backup) of the original ONSChallenge page AbrahamSolventModel003
Objective
To investigate the predicted solvation properties of solvents deemed safe by the EPA. These solvents will be compared to solvents with known Abraham solvent coefficients with the outlook of both potentially replacing existing solvents with safer solvents and to find potential new safe solvents to investigate that reside in a part of the chemical space currently not occupied by solvents with known Abraham coefficients.
Procedure
The Abraham general solvation model uses the LFER
log P = c + e E + s S + a A + b B + v V
where c,e,s,a,b,v are the solvent coefficients and E,S,A,B,V are the solute descriptors, see this brief discussion of the model. The Abraham coefficients are found via linear regression from measured data. The standard procedure is to allow the c-coefficient (the intercept) to float in the linear regression. It has been suggested that c should not be negative[1]. We suggest that little predictive ability will be lost if we just require c to be zero. This will also allow easier comparison between solvents. Thus in order to compare both current solvents with each other and potential new solvents with current solvents, we decided to re-calculate the coefficients for known solvents e_0, s_0, a_0, b_0, v_0 by making c zero. This was achieved by calculating the log P values in 90 solvents for 2144 compounds with known Abraham descriptors from our Open Abraham Descriptors Database and then re-running the linear regression using R. The following code with results is typical:
setwd(".../MakingCZero")
mydata = read.csv(file="makingczeroreadyforR.csv",head=TRUE,row.names="csid")
fit <- lm(isopropyl.myristate ~ 0 + E + S + A + B + V,data=mydata)
## summary of fit
summary(fit)
[output]
Call:
lm(formula = isopropyl.myristate ~ 0 + E + S + A + B + V, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-0.55191 -0.25598 -0.13732 0.00069 1.78549
Coefficients:
Estimate Std. Error t value Pr(>|t|)
E 0.977259 0.011781 82.95 <2e-16 ***
S -1.294959 0.014814 -87.41 <2e-16 ***
A -1.870114 0.020493 -91.26 <2e-16 ***
B -4.017729 0.015120 -265.73 <2e-16 ***
V 3.939081 0.007844 502.19 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2503 on 2139 degrees of freedom
Multiple R-squared: 0.9958, Adjusted R-squared: 0.9958
F-statistic: 1.009e+05 on 5 and 2139 DF, p-value: < 2.2e-16
[output]
The following table lists the original solvent coefficients together with the c=0 adjusted coefficients. Not surprisingly, the largest changes in coefficient values occur for solvents with c-values furthest away from zero. What is a little intriguing is that all the coefficients move consistently that same way. That is, solvents with negative c-values all saw an increase in e and b (and a decrease in s,a, and v) when recalculation was performed, whereas solvents with positive c-values all saw an increase in s,a, and v (and decrease in e and b). By multiplying the average absolute deviation by the average descriptor value gives a measure of the degree by which the coefficients were changed. The adjusted coefficients changed (as measured by e.g. AAE(v_0) * Mean(V)) in the order v (0.124), s (0.043), e (0.013), b (0.011), a (0.010).
c
e
s
a
b
v
solvent
e_0
s_0
a_0
b_0
v_0
0.17
0.4
-1.01
0.06
-3.96
4.04
1-butanol
0.387596
-0.97209
0.108258
-3.97885
4.12895
0.22
0.27
-0.57
-2.92
-4.88
4.46
1-chlorobutane
0.254833
-0.516892
-2.847674
-4.910816
4.57048
-0.06
0.62
-1.32
0.03
-4.15
4.28
1-decanol
0.6203042
-1.3327395
0.0090799
-4.1464701
4.2497972
0.04
0.4
-1.06
0
-4.34
4.32
1-heptanol
0.3948325
-1.0545913
0.0143256
-4.3468223
4.3352649
0.12
0.71
-1.62
-3.18
-4.8
4.32
1-hexadecene
0.696611
-1.588924
-3.143734
-4.810655
4.38202
0.12
0.49
-1.16
0.05
-3.98
4.13
1-hexanol
0.482763
-1.137261
0.090893
-3.993082
4.190662
-0.03
0.49
-1.04
-0.02
-4.24
4.22
1-octanol
0.4909845
-1.0517092
-0.0336817
-4.2309735
4.2009611
0.15
0.54
-1.23
0.14
-3.86
4.08
1-pentanol
0.52385
-1.193867
0.188379
-3.88273
4.154244
0.14
0.41
-1.03
0.25
-3.77
3.99
1-propanol
0.393466
-0.996062
0.291203
-3.784515
4.05757
0.18
0.29
-0.13
-2.8
-4.29
4.18
1,2-dichloroethane
0.278894
-0.090721
-2.742532
-4.313978
4.274203
0.12
0.35
-0.03
-0.58
-4.81
4.11
1,4-dioxane
0.33675
-0.003994
-0.542209
-4.825876
4.173499
0.1
0.62
-1.8
-3.07
-4.29
4.52
1,9-decadiene
0.606347
-1.771424
-3.037034
-4.304097
4.571908
0.13
0.25
-0.98
0.16
-3.88
4.11
2-butanol
0.242337
-0.945988
0.198721
-3.897972
4.179391
0.19
0.35
-1.13
0.02
-3.57
3.97
2-methyl-1-propanol
0.338508
-1.082867
0.07594
-3.591543
4.064762
0.21
0.17
-0.95
0.33
-4.09
4.11
2-methyl-2-propanol
0.153709
-0.897144
0.397899
-4.111567
4.217508
0.12
0.46
-1.33
0.21
-3.75
4.2
2-pentanol
0.445389
-1.303994
0.243289
-3.759434
4.260301
0.1
0.34
-1.05
0.41
-3.83
4.03
2-propanol
0.334933
-1.025916
0.437909
-3.839249
4.084005
0.32
0.51
-1.69
-3.69
-4.81
4.4
2,2,4-trimethylpentane
0.485433
-1.610035
-3.586041
-4.850718
4.563507
0.07
0.36
-1.27
0.09
-3.77
4.4
3-methyl-1-butanol
0.35352
-1.255543
0.11342
-3.779382
4.43679
0.31
0.31
-0.12
-0.61
-4.75
3.94
acetone
0.286706
-0.04747
-0.508846
-4.792269
4.102844
0.41
0.08
0.33
-1.57
-4.39
3.36
acetonitrile
0.044129
0.423135
-1.4362
-4.443325
3.576285
0.14
0.46
-0.59
-3.01
-4.63
4.49
benzene
0.452175
-0.554143
-2.963555
-4.643338
4.564318
0.1
0.29
0.06
-1.61
-4.56
4.03
benzonitrile
0.277045
0.081936
-1.574291
-4.574904
4.07839
-0.02
0.44
-0.42
-3.17
-4.56
4.45
bromobenzene
0.4369346
-0.4279276
-3.1781219
-4.5563083
4.4367652
0.25
0.26
-0.08
-0.77
-4.86
4.15
butanone
0.236179
-0.022077
-0.68909
-4.886268
4.274532
0.25
0.36
-0.5
-0.87
-4.97
4.28
butyl acetate
0.336024
-0.442788
-0.788152
-5.004535
4.408761
0.05
0.69
-0.94
-3.6
-5.82
4.92
carbon disulfide
0.6819348
-0.9318396
-3.5870443
-5.8248542
4.9458403
0.2
0.52
-1.16
-3.56
-4.59
4.62
carbon tetrachloride
0.506806
-1.112282
-3.496545
-4.619008
4.720644
0.07
0.38
-0.52
-3.18
-4.7
4.61
chlorobenzene
0.3752336
-0.5056045
-3.1613969
-4.7083071
4.6478423
0.19
0.11
-0.4
-3.11
-3.51
4.4
chloroform
0.089413
-0.357874
-3.051291
-3.537934
4.493193
0.16
0.78
-1.68
-3.74
-4.93
4.58
cyclohexane
0.770769
-1.640414
-3.689346
-4.948869
4.659072
0.04
0.23
0.06
-0.98
-4.84
4.32
cyclohexanone
0.2216766
0.0668337
-0.962833
-4.8469389
4.3348761
0.19
0.72
-1.74
-3.45
-4.97
4.48
decane
0.706653
-1.697274
-3.389851
-4.993254
4.571974
0.18
0.39
-0.99
-1.41
-5.36
4.52
dibutyl ether
0.37973
-0.94367
-1.357836
-5.379393
4.614845
0.33
0.3
-0.44
0.36
-4.9
3.95
dibutylformamide
0.275223
-0.3577
0.462459
-4.944045
4.122737
0.32
0.1
-0.19
-3.06
-4.09
4.32
dichloromethane
0.076325
-0.111839
-2.957248
-4.130085
4.487963
0.35
0.36
-0.82
-0.59
-4.96
4.35
diethyl ether
0.329941
-0.737491
-0.477868
-5.000297
4.530001
0.21
0.03
0.09
1.34
-5.08
4.09
diethylacetamide
0.0167
0.139253
1.409338
-5.111165
4.197615
-0.27
0.08
0.21
0.92
-5
4.56
dimethylacetamide
0.104844
0.145499
0.831639
-4.970213
4.418588
-0.31
-0.06
0.34
0.36
-4.87
4.49
DMF
-0.034208
0.27126
0.264139
-4.827615
4.330082
-0.19
0.33
0.79
1.26
-4.54
3.36
DMSO
0.341749
0.745718
1.200253
-4.516908
3.262081
0.11
0.67
-1.64
-3.55
-5.01
4.46
dodecane
0.658583
-1.616828
-3.508611
-5.020509
4.517868
0.22
0.47
-1.04
0.33
-3.6
3.86
ethanol
0.453079
-0.983328
0.396198
-3.623493
3.971272
-0.17
-0.02
0
0.07
-0.37
0.45
ethanol/water(10:90)vol
-0.009316
-0.04156
0.010626
-0.350331
0.365271
-0.25
0.04
-0.04
0.1
-0.83
0.92
ethanol/water(20:80)vol
0.062777
-0.098993
0.01722
-0.800521
0.786631
-0.27
0.11
-0.1
0.13
-1.32
1.41
ethanol/water(30:70)vol
0.127804
-0.160996
0.049426
-1.282852
1.276293
-0.22
0.13
-0.16
0.17
-1.81
1.92
ethanol/water(40:60)vol
0.148332
-0.210817
0.102648
-1.781936
1.804907
-0.14
0.12
-0.25
0.25
-2.28
2.42
ethanol/water(50:50)vol
0.134901
-0.285203
0.207128
-2.257463
2.342294
-0.04
0.14
-0.34
0.29
-2.68
2.81
ethanol/water(60:40)vol
0.1406465
-0.3442878
0.281256
-2.6697239
2.7916452
0.06
0.09
-0.37
0.31
-2.94
3.1
ethanol/water(70:30)vol
0.0794107
-0.3529953
0.331463
-2.9438845
3.1344568
0.17
0.18
-0.47
0.26
-3.21
3.32
ethanol/water(80:20)vol
0.161026
-0.424495
0.314211
-3.233463
3.411426
0.24
0.21
-0.58
0.26
-3.45
3.55
ethanol/water(90:10)vol
0.193477
-0.51767
0.338521
-3.480739
3.669878
0.33
0.37
-0.45
-0.7
-4.9
4.15
ethyl acetate
0.342809
-0.369036
-0.596948
-4.94523
4.318697
0.09
0.47
-0.72
-3
-4.84
4.51
ethylbenzene
0.459437
-0.701228
-2.970828
-4.855741
4.56218
-0.27
0.58
-0.51
0.72
-2.62
2.73
ethylene glycol
0.599449
-0.574819
0.631321
-2.585314
2.5908
0.14
0.15
-0.37
-3.03
-4.6
4.54
fluorobenzene
0.140337
-0.340978
-2.985464
-4.618238
4.611483
-0.17
0.07
0.31
0.59
-3.15
2.43
formamide
0.083307
0.267828
0.536608
-3.131516
2.344771
0.3
0.64
-1.76
-3.57
-4.95
4.49
heptane
0.61919
-1.685129
-3.477189
-4.983132
4.640776
0.09
0.67
-1.62
-3.59
-4.87
4.43
hexadecane
0.659893
-1.59632
-3.559573
-4.880281
4.47815
0.33
0.56
-1.71
-3.58
-4.94
4.46
hexane
0.53342
-1.631725
-3.473425
-4.980698
4.634317
-0.19
0.3
-0.31
-3.21
-4.65
4.59
iodobenzene
0.312539
-0.352762
-3.271785
-4.629052
4.489752
-0.61
0.93
-1.15
-1.68
-4.09
4.25
isopropyl myristate
0.977259
-1.294959
-1.870114
-4.017729
3.939081
0.12
0.38
-0.6
-2.98
-4.96
4.54
m-xylene
0.366587
-0.574078
-2.941283
-4.976929
4.598299
0.28
0.33
-0.71
0.24
-3.32
3.55
methanol
0.311909
-0.649107
0.329542
-3.354582
3.690751
0.35
0.22
-0.15
-1.04
-4.53
3.97
methyl acetate
0.194997
-0.067588
-0.923983
-4.571216
4.15239
0.34
0.31
-0.82
-0.62
-5.1
4.43
methyl tert-butyl ether
0.279699
-0.737134
-0.510026
-5.139775
4.600429
0.25
0.78
-1.98
-3.52
-4.29
4.53
methylcyclohexane
0.762327
-1.924196
-3.439318
-4.323834
4.654703
0.28
0.13
-0.44
1.18
-4.73
3.86
N-ethylacetamide
0.105071
-0.374993
1.269385
-4.764184
4.002187
0.22
0.03
-0.17
0.94
-4.59
3.73
N-ethylformamide
0.016449
-0.114321
1.004651
-4.616979
3.84333
-0.03
0.7
-0.06
0.01
-4.09
3.41
N-formylmorpholine
0.6981457
-0.0694897
0.0048883
-4.0885654
3.38906
0.06
0.33
0.26
1.56
-5.04
3.98
N-methyl-2-piperidone
0.3271873
0.2705115
1.5746338
-5.0436057
4.0124292
0.09
0.21
-0.17
1.31
-4.59
3.83
N-methylacetamide
0.19721
-0.150831
1.334533
-4.600626
3.879615
0.11
0.41
-0.29
0.54
-4.09
3.47
N-methylformamide
0.397604
-0.260136
0.578616
-4.099689
3.529845
0.15
0.53
0.23
0.84
-4.79
3.67
N-methylpyrrolidinone
0.519565
0.259902
0.887089
-4.813222
3.749914
-0.2
0.54
0.04
-2.33
-4.61
4.31
nitrobenzene
0.551741
-0.003723
-2.388352
-4.584066
4.213974
0.02
-0.09
0.79
-1.46
-4.36
3.46
nitromethane
-0.0933342
0.7985957
-1.4544755
-4.3676129
3.4722537
0.24
0.62
-1.71
-3.53
-4.92
4.48
nonane
0.599859
-1.65665
-3.456521
-4.951366
4.605711
0.08
0.52
-0.81
-2.88
-4.82
4.56
o-xylene
0.511059
-0.793315
-2.857401
-4.831364
4.601817
-0.1
0.15
-0.84
-0.44
-4.04
4.13
octadecanol
0.155261
-0.863525
-0.466854
-4.028093
4.075935
0.23
0.74
-1.84
-3.59
-4.91
4.5
octane
0.719433
-1.785636
-3.512058
-4.936095
4.620999
0.17
0.48
-0.81
-2.94
-4.87
4.53
p-xylene
0.463092
-0.772761
-2.885801
-4.895116
4.617725
0.57
0.72
-1.03
-1.3
-4.51
3.45
peanut oil
0.66965
-0.89221
-1.12075
-4.58151
3.74435
0.37
0.39
-1.57
-3.54
-5.22
4.51
pentane
0.35651
-1.481294
-3.418818
-5.261024
4.703599
0
0.17
0.5
-1.28
-4.41
3.42
propylene carbonate
0.1672359
0.505135
-1.2809844
-4.4080414
3.4234811
0
0.15
0.6
-0.38
-4.54
3.29
sulfolane
0.1468503
0.6009136
-0.3799049
-4.541574
3.2903215
0.22
0.36
-0.38
-0.24
-4.93
4.45
THF
0.345051
-0.331628
-0.167145
-4.96046
4.564853
0.13
0.43
-0.64
-3
-4.75
4.52
toluene
0.420597
-0.614527
-2.961869
-4.763681
4.588524
0.33
0.57
-0.84
-1.07
-4.33
3.92
tributyl phosphate
0.543888
-0.760593
-0.965937
-4.373768
4.087161
0.4
-0.09
-0.59
-1.28
-1.27
3.09
trifluoroethanol
-0.125647
-0.50143
-1.155862
-1.322677
3.290636
0.06
0.6
-1.66
-3.42
-5.12
4.62
undecane
0.5979334
-1.6471654
-3.4017847
-5.1276719
4.6493317
Using the above adjusted coefficients new RF models were created using R (v3.0.0) and Rajarshi Guha's CDK Descriptor Calculator (v1.3.9). First we used R to perform feature selection
Then the models themselves were created using code like
library("randomForest") #for modeling
setwd(".../MakingCZero")
mydata = read.csv(file="CDKReady4Ra.csv",head=TRUE,row.names="Title")
mydata.rf <- randomForest(a_0 ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = a_0 ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 22
Mean of squared residuals: 0.2272567
% Var explained: 91.89
[output]
## get variable importance plot
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
## save the model
saveRDS(mydata.rf, file = "arfmodel")
## predict using the random forest model
test.predict <- predict(mydata.rf,mydata)
## write the predictions to the working directory
write.csv(test.predict, file = "RFTestPredicta.csv")
The models were used to predict the coefficients of the training set to examine if any of the solvents were outliers. This could indicate that certain solvent coefficients were in need of updating. The solvents which had the largest errors were (the first 5 being especially suspect): trifluoroethanol, carbon disulfide, formamide, isopropyl myristate, ethylene glycol, DMF, octadecanol, DMSO, chloroform, nitromethane, carbon tetrachloride, N-formylmorpholine, methylcyclohexane, sulfolane, N-methylacetamide.
EPA Solvents
SMILES for the potential new safe solvents were extracted from ChemSpider using the CAS and names. Solvent that already have measured coefficients plus Fatty acids (C16-18 and C18-unsatd., methyl esters), (Glycerides, mixed decanoyl and octanoyl), (Soybean oil, methyl esters), (Tripropylene glycol n-butyl ether), (White mineral oil, petroleum), (Fatty acids, C12-18, methyl esters), and Polypropylene glycol.
CDK descriptors were then calculated which in turn allowed us to predict the solvent coefficients:
By calculating the distance to each solvent with known coefficients - sqrt(sum((measured-predicted)/measuredSD)^2) - we identified possible solvent replacements that are predicted to have similar solvation properties:
Current Solvent
Possible Alternate Solvent
Distance
1-octanol
1-dodecanol
0.295
ethanol
1,3-butanediol
0.576
1-propanol
1,3-butanediol
0.585
acetone
propylene glycol methyl ether acetate
0.499
methyl acetate
propylene glycol methyl ether acetate
0.502
benzonitrile
propylene glycol methyl ether acetate
0.576
1,4-dioxane
propylene glycol methyl ether acetate
0.677
methanol
1,2-propanediol
0.517
methanol
1-(2-methoxy-1-methylethoxy)-2-propanol
0.574
methanol
1-methoxy-2-propanol
0.617
methanol
glycerol
0.722
2,2,4-trimethylpentane
D-limonene
0.619
hexane
D-limonene
0.621
Principle component analysis, in R, was used to help visualize where both current and potential new solvents lie in the chemical space.
Solvents recommended to be updated with high priority: trifluoroethanol, carbon disulfide, formamide, isopropyl myristate, ethylene glycol.
Possible alternative solvents
Current Solvent
Possible Alternate Solvent
Distance
1-octanol
1-dodecanol
0.295
ethanol
1,3-butanediol
0.576
1-propanol
1,3-butanediol
0.585
acetone
propylene glycol methyl ether acetate
0.499
methyl acetate
propylene glycol methyl ether acetate
0.502
benzonitrile
propylene glycol methyl ether acetate
0.576
1,4-dioxane
propylene glycol methyl ether acetate
0.677
methanol
1,2-propanediol
0.517
methanol
1-(2-methoxy-1-methylethoxy)-2-propanol
0.574
methanol
1-methoxy-2-propanol
0.617
methanol
glycerol
0.722
2,2,4-trimethylpentane
D-limonene
0.619
hexane
D-limonene
0.621
Possible new safe solvents in a new part of the chemical space: 4-Hydroxymethyl-1,3-dioxolan-2-one and Ethyl lactate.
References
[1] Paul C.M. van Noort. Solvation thermodynamics and the physical–chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073
Solvent Coefficients For Alternative 'Safe' Solvents
Researchers: Jean-Claude Bradley and Andrew SID LangAll content, models and data are released as CC0 - the default license for all our ONS work.
This page is a duplicate (backup) of the original ONSChallenge page AbrahamSolventModel003
Objective
To investigate the predicted solvation properties of solvents deemed safe by the EPA. These solvents will be compared to solvents with known Abraham solvent coefficients with the outlook of both potentially replacing existing solvents with safer solvents and to find potential new safe solvents to investigate that reside in a part of the chemical space currently not occupied by solvents with known Abraham coefficients.Procedure
The Abraham general solvation model uses the LFERlog P = c + e E + s S + a A + b B + v V
where c,e,s,a,b,v are the solvent coefficients and E,S,A,B,V are the solute descriptors, see this brief discussion of the model. The Abraham coefficients are found via linear regression from measured data. The standard procedure is to allow the c-coefficient (the intercept) to float in the linear regression. It has been suggested that c should not be negative[1]. We suggest that little predictive ability will be lost if we just require c to be zero. This will also allow easier comparison between solvents. Thus in order to compare both current solvents with each other and potential new solvents with current solvents, we decided to re-calculate the coefficients for known solvents e_0, s_0, a_0, b_0, v_0 by making c zero. This was achieved by calculating the log P values in 90 solvents for 2144 compounds with known Abraham descriptors from our Open Abraham Descriptors Database and then re-running the linear regression using R. The following code with results is typical:
setwd(".../MakingCZero") mydata = read.csv(file="makingczeroreadyforR.csv",head=TRUE,row.names="csid") fit <- lm(isopropyl.myristate ~ 0 + E + S + A + B + V,data=mydata) ## summary of fit summary(fit) [output] Call: lm(formula = isopropyl.myristate ~ 0 + E + S + A + B + V, data = mydata) Residuals: Min 1Q Median 3Q Max -0.55191 -0.25598 -0.13732 0.00069 1.78549 Coefficients: Estimate Std. Error t value Pr(>|t|) E 0.977259 0.011781 82.95 <2e-16 *** S -1.294959 0.014814 -87.41 <2e-16 *** A -1.870114 0.020493 -91.26 <2e-16 *** B -4.017729 0.015120 -265.73 <2e-16 *** V 3.939081 0.007844 502.19 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2503 on 2139 degrees of freedom Multiple R-squared: 0.9958, Adjusted R-squared: 0.9958 F-statistic: 1.009e+05 on 5 and 2139 DF, p-value: < 2.2e-16 [output]The following table lists the original solvent coefficients together with the c=0 adjusted coefficients. Not surprisingly, the largest changes in coefficient values occur for solvents with c-values furthest away from zero. What is a little intriguing is that all the coefficients move consistently that same way. That is, solvents with negative c-values all saw an increase in e and b (and a decrease in s,a, and v) when recalculation was performed, whereas solvents with positive c-values all saw an increase in s,a, and v (and decrease in e and b). By multiplying the average absolute deviation by the average descriptor value gives a measure of the degree by which the coefficients were changed. The adjusted coefficients changed (as measured by e.g. AAE(v_0) * Mean(V)) in the order v (0.124), s (0.043), e (0.013), b (0.011), a (0.010).library(caret) #for feature selection setwd(".../MakingCZero") mydata = read.csv(file="CDKReady4RFeatureSelectection.csv",head=TRUE,row.names="Title") ncol(mydata) [output] [1] 207 [output] nzv <-nearZeroVar(mydata) # remove zeros and other small variance columns mydata <- mydata[, -nzv] ncol(mydata) [output] [1] 111 [output] cor.mat = cor(mydata) ## find correlation r > 0.90 highCorr <- findCorrelation(cor.mat, cutoff = .90, verbose = TRUE) ## remove the highly correlated columns mydata <- mydata[, -highCorr] ncol(mydata) [output] [1] 68 [output] write.csv(mydata, file = "CDKFeatureSelected.csv")Then the models themselves were created using code likelibrary("randomForest") #for modeling setwd(".../MakingCZero") mydata = read.csv(file="CDKReady4Ra.csv",head=TRUE,row.names="Title") mydata.rf <- randomForest(a_0 ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = a_0 ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 22 Mean of squared residuals: 0.2272567 % Var explained: 91.89 [output] ## get variable importance plot varImpPlot(mydata.rf,main="Random Forest Variable Importance") ## save the model saveRDS(mydata.rf, file = "arfmodel") ## predict using the random forest model test.predict <- predict(mydata.rf,mydata) ## write the predictions to the working directory write.csv(test.predict, file = "RFTestPredicta.csv")The models were used to predict the coefficients of the training set to examine if any of the solvents were outliers. This could indicate that certain solvent coefficients were in need of updating. The solvents which had the largest errors were (the first 5 being especially suspect): trifluoroethanol, carbon disulfide, formamide, isopropyl myristate, ethylene glycol, DMF, octadecanol, DMSO, chloroform, nitromethane, carbon tetrachloride, N-formylmorpholine, methylcyclohexane, sulfolane, N-methylacetamide.EPA Solvents
SMILES for the potential new safe solvents were extracted from ChemSpider using the CAS and names. Solvent that already have measured coefficients plus Fatty acids (C16-18 and C18-unsatd., methyl esters), (Glycerides, mixed decanoyl and octanoyl), (Soybean oil, methyl esters), (Tripropylene glycol n-butyl ether), (White mineral oil, petroleum), (Fatty acids, C12-18, methyl esters), and Polypropylene glycol.CDK descriptors were then calculated which in turn allowed us to predict the solvent coefficients:
setwd(".../MakingCZero") mydata = read.csv(file="EPA-PCA4R.csv",head=TRUE,row.names="Title") pc1 <- prcomp(mydata, scale. = T) x <- pc1$x summary(pc1) [output] Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation 1.7181 1.0367 0.7622 0.56519 0.2702 Proportion of Variance 0.5904 0.2150 0.1162 0.06389 0.0146 Cumulative Proportion 0.5904 0.8053 0.9215 0.98540 1.0000 [output]Results
Solvents recommended to be updated with high priority: trifluoroethanol, carbon disulfide, formamide, isopropyl myristate, ethylene glycol.Possible alternative solvents
Possible new safe solvents in a new part of the chemical space: 4-Hydroxymethyl-1,3-dioxolan-2-one and Ethyl lactate.
References
[1] Paul C.M. van Noort. Solvation thermodynamics and the physical–chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073